Name: N S Ramanujam Mangena
We were given a dataset which states who availed Term deposit from the previous campaign. Before we start with analyis, as an every data science project lets establish some basic understanding of the given problem statement.
Initial inference is just 11.7% of the members had availed Term deposit. This suggests that data is highly inbalanced. Now our aim is to build a model to predict the number of people who had actually taken Term deposit. Hence False negatives are vital for model performance and we concentrate on recall to measure the model's performance.
More over this is a classification problem as our target column is 'Target' which is categorical in nature. 1/0(True/False). We use Ensemble methods to build predictive models.
#import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
bank_df = pd.read_csv("bank-full.csv")
bank_df.shape
bank_df2 = bank_df.copy()
bank_df.head()
sum(bank_df.duplicated())
Suggests no duplicate rows
bank_df.isnull().sum()
bank_df.isna().sum()
suggests that no null vaalues or NaN rows in the dataset.
bank_df.describe()
bank_df.skew().sort_values()
Previous column is highly skewed.
bank_df[bank_df.pdays < 0].pdays.count()
As its very high in quantity, We can consider absolute values of the column for pdays.
bank_df[bank_df.balance < 0].balance.count()
We have 3766 members who have utilised overdraft account.
bank_df.info()
Many columns have the object datatype which cant be used to perform any model building. We will later use Lable Encoder library to convert them to machine readable formats.
bank_df.drop(['duration'],axis =1,inplace = True)
This columns is dropped as it wouldnt be useful in building prediction models and problem statement advised us to ignore this colum as it could lead into inappropriate model building.
bank_df['pdays'] = bank_df['pdays'].abs()
#Drop duplicate records
bank_df.drop_duplicates(inplace =True)
sum(bank_df.duplicated())
bank_df2[['education','job','marital','default','housing','loan','contact','month','poutcome','Target']].apply(lambda x: x.value_counts()).T.stack()
Note:
As unknown values are given as categorical in problem statement, we are treating them as one of the category. They are not equal to NaN where actual data would be missing.
Lable Encoding
We will replace above listed columns with the machine learning format. Here were using lable encoder technique so that we dont create new columns(like in One-Hot encoding). Object columns as well are covereted to integer.
#Lable Encoder to change the categorical values to numerical values.
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
bank_df.job = le.fit_transform(bank_df.job)
bank_df.marital = le.fit_transform(bank_df.marital)
bank_df.education = le.fit_transform(bank_df.education)
bank_df.default = le.fit_transform(bank_df.default)
bank_df.housing = le.fit_transform(bank_df.housing)
bank_df.loan = le.fit_transform(bank_df.loan)
bank_df.contact = le.fit_transform(bank_df.contact)
bank_df.month = le.fit_transform(bank_df.month)
bank_df.poutcome = le.fit_transform(bank_df.poutcome)
bank_df.Target = le.fit_transform(bank_df.Target)
bank_df.head()
bank_df.info()
Univariate/Bivariate Analysis
Its clear that below are contineous variables in given dataset.
Rest of the values are categorical/Nominal/binary We can perform Univariate analysis on these parameters.
plt.figure(figsize=(30,5))
# subplot 1
plt.subplot(1, 5, 1)
plt.title('Customer Age Distribution')
sns.distplot(bank_df.age,color='Red')
# subplot 2
plt.subplot(1, 5, 2)
plt.title('Customer balance')
sns.distplot(bank_df.balance,color='green')
# subplot 3l
plt.subplot(1, 5, 3)
plt.title('Customer last contacted')
sns.distplot(bank_df.day,color='blue')
# subplot 4l
#plt.subplot(1, 5, 4)
#plt.title('Customer last conversation duration')
#sns.distplot(bank_df.duration,color='Orange')
# subplot 5l
plt.subplot(1, 5, 4)
plt.title('Customer campaign Distribution')
sns.distplot(bank_df.campaign,color='brown')
plt.figure(figsize=(30,5))
# subplot 1
plt.subplot(1, 5, 1)
plt.title('Customer Age boxribution')
sns.boxplot(bank_df.age,orient='vertical',color='Red')
# subplot 2
plt.subplot(1, 5, 2)
plt.title('Customer balance')
sns.boxplot(bank_df.balance,orient='vertical',color='green')
# subplot 3l
plt.subplot(1, 5, 3)
plt.title('Customer last contacted')
sns.boxplot(bank_df.day,orient='vertical',color='blue')
# subplot 4l
#plt.subplot(1, 5, 4)
#plt.title('Customer last conversation duration')
#sns.boxplot(bank_df.duration,orient='vertical',color='Orange')
# subplot 5l
plt.subplot(1, 5, 4)
plt.title('Customer campaign distribution')
sns.boxplot(bank_df.campaign,orient='vertical',color='brown')
Data is highly skewed and unbalanced. Further inferences to be followed post profiling.
Alternatively, there is a pandas profling library which can be used to analyse each and every attribute in the given dataset.
import pandas_profiling
from pandas_profiling import ProfileReport
ProfileReport(bank_df)
Data set is higly unblanced dataset and all the variables seems to be weak learners based on the correlation values. Dataset has only 11.7% Target values with Yes.
plt.title('Target Distribution')
ax=sns.countplot(data = bank_df, x= 'Target',palette='inferno')
for p in ax.patches:
ax.annotate(str((np.round(p.get_height()/len(bank_df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
categorcial_variables = ['marital', 'education','default','housing','loan','contact','poutcome','month','job']
plt.figure(figsize=(60,10))
#categorcial_variables = ['job', 'marital', 'education','default','housing','loan','contact','month','poutcome','Target']
j = range(len(categorcial_variables))
for i in j:
plt.subplot(1, 9, i+1)
ax=sns.countplot(data = bank_df,palette='inferno', x= categorcial_variables[i])
for p in ax.patches:
ax.annotate(str((np.round(p.get_height()/len(bank_df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.figure(figsize=(60,10))
#categorcial_variables = ['job', 'marital', 'education','default','housing','loan','contact','month','poutcome','Target']
p = range(len(categorcial_variables))
for i in p:
plt.subplot(1, 9, i+1)
ax=sns.countplot(data = bank_df, x= categorcial_variables[i],palette='inferno',hue = 'Target')
for p in ax.patches:
ax.annotate(str((np.round(p.get_height()/len(bank_df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.figure(figsize=(10,4))
sns.barplot(bank_df2['job'].value_counts().values, bank_df2['job'].value_counts().index)
plt.title('job')
plt.tight_layout()
Trying to remove the outliers of mortgage with IQR is not helpful as its removing the valid data. Hence some of the outliers are reduced with z score and beyond threshold of 3. We would be removing 745 rows with is close to 2% which is with in permissable limits. Morever we are not worried about balance outliers as current dataset is highly unbalanced and further removing data would impact the predictability of the model.
#import z statistic library
from scipy import stats
bank_df['balance_z'] = np.abs(stats.zscore(bank_df.balance))
bank_df=bank_df[bank_df.balance_z <3]
bank_df.drop('balance_z', axis = 1, inplace=True)
bank_df.shape
# correlation on the given dataset
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = bank_df.corr()
corr.style.background_gradient(cmap='coolwarm')
import seaborn as sns
sns.pairplot(bank_df)
plt.figure(figsize=(30, 6))
plt.subplot(1,6,1)
#sns.jointplot(x = 'Income', y = 'CCAvg', data = loan_df,kind = 'kde')
#plt.subplot(1,4,2)
plt.title('Age vs Target')
sns.boxplot(x = 'age', y = 'Target',orient="horizontal",data = bank_df)
plt.subplot(1,6,2)
plt.title('Balance vs Target')
sns.boxplot(x = 'balance', y = 'Target',orient="horizontal", data = bank_df)
Inspite of clearing the outliers, couldnt notice much improvement with the outliers and above correlation matrix suggests very weak relation. Though we have weak learners we try to make use of them and try to use ensemble learning to improve the predictions.
Positively correlated columns
Negatively correlated columns
Though columns have negative correlation and loosely corelated, we are not removing them as weak learners contribute for building decision trees and ensemble methods.
#Train and Test data split library import
from sklearn.model_selection import train_test_split
X = bank_df.drop(['Target'],axis=1) # independant variables
y = bank_df['Target'] #Dependant variables
Split Data to 70:30 Train and Test ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
print("Original Target True Values : {0} ({1:0.2f}%)".format(len(bank_df.loc[bank_df['Target'] == 1]), (len(bank_df.loc[bank_df['Target'] == 1])/len(bank_df.index)) * 100))
print("Original Target False Values : {0} ({1:0.2f}%)".format(len(bank_df.loc[bank_df['Target'] == 0]), (len(bank_df.loc[bank_df['Target'] == 0])/len(bank_df.index)) * 100))
print("")
print("Training Target True Values : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Target False Values : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Target True Values : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Target False Values : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")
#Import required metrics libraries
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
#Logistic regression model import
from sklearn.linear_model import LogisticRegression
#importing cross validation and Grid search library
from sklearn.model_selection import GridSearchCV
Hyper tuning parameters with the help of GridsearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
model=LogisticRegression(max_iter=50000)
model_cv=GridSearchCV(model,grid,cv=10)
model_cv.fit(X_train,y_train)
print("Tuned hpyerparameters :(best parameters) ",model_cv.best_params_)
print("Accuracy :",model_cv.best_score_)
# Fit the model with the help of hyper parameters
model2=LogisticRegression(C=1000,penalty="l2")
#model2=LogisticRegression()
model2.fit(X_train,y_train)
#predict on test
y_pred = model2.predict(X_test)
print('\033[91m' + "Logistic Regression Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Logistic Regression Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
log_df_cm = pd.DataFrame(mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
log_df_cm
#coef_df = pd.DataFrame(model.coef_)
#coef_df['intercept'] = model.intercept_
#print(coef_df)
ROC/AUC Curve
log_roc_auc = roc_auc_score(y_test, model2.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model2.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %02f)' % log_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
nb_clf = GaussianNB()
NB_result = nb_clf.fit(X_train, y_train)
nb_clf.fit(X_train, y_train)
y_pred = nb_clf.predict(X_test)
print('\033[91m' + "Naive Bayes Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Naive Bayes Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
nv_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
nv_df_cm = pd.DataFrame(nv_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
nv_df_cm
ROC/AUC Curve
NB_roc_auc = roc_auc_score(y_test,nb_clf.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, nb_clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='NB Classifier (area = %2f)' % NB_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('NB_ROC')
plt.show()
With KNN we deal with neighbouring data points. There is no specific criteria to choose the K value. Generally we start with neighbours of 3,5,7 etc., However when calculate accuracy score is reducing when calculated with 5 and 7. Hence we have confined to the test with k = 3.
# loading library
from sklearn.neighbors import KNeighborsClassifier
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors = 3)
# fitting the model
knn.fit(X_train, y_train)
# predict the response
y_pred = knn.predict(X_test)
print('\033[91m' + "KNN Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "KNN Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
knn_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
knn_df_cm = pd.DataFrame(knn_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
knn_df_cm
KNN_roc_auc = roc_auc_score(y_test,knn.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, knn.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='KNN Classifier (area = %2f)' % KNN_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('NB_ROC')
plt.show()
from sklearn.tree import DecisionTreeClassifier
#from sklearn.feature_extraction.text import CountVectorizer #DT does not take strings as input for the model fit step....
from IPython.display import Image
#import pydotplus as pydot
from sklearn import tree
from os import system
from graphviz import Source
dTree = DecisionTreeClassifier(criterion = 'entropy')
dTree.fit(X_train, y_train)
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
Decisions trees are prone to overfit as in this case. Hence we implement pruning of branches. Lets see that further in regularization
train_char_label = ['1', '0']
Credit_Tree_File = open('credit_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Credit_Tree_File, feature_names = list(X_train), class_names = list(train_char_label))
Credit_Tree_File.close()
retCode = system("dot -Tpng credit_tree.dot -o credit_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("credit_tree.png"))
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values("Imp", ascending= False))
y_pred = dTree.predict(X_test)
print('\033[91m' + "Decision Tree Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Decision Tree Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
dt_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
dt_df_cm = pd.DataFrame(dt_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
dt_df_cm
ROC/AUC Curve
DT_roc_auc = roc_auc_score(y_test,dTree.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, dTree.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='DTree Classifier (area = %2f)' % DT_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('DT_ROC')
plt.show()
Lets restrict the depth to 7 and rerun decision tree
reg_dt_model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 7)
reg_dt_model.fit(X_train, y_train)
print(reg_dt_model.score(X_train, y_train))
print(reg_dt_model.score(X_test, y_test))
train_char_label = ['1', '0']
reg_dt_model_File = open('reg_dt_model.dot','w')
dot_data = tree.export_graphviz(reg_dt_model, out_file=reg_dt_model_File, feature_names = list(X_train), class_names = list(train_char_label))
reg_dt_model_File.close()
retCode = system("dot -Tpng reg_dt_model.dot -o reg_dt_model_File.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("reg_dt_model_File.png"))
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(reg_dt_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values("Imp", ascending= False))
y_pred = reg_dt_model.predict(X_test)
print('\033[91m' + "Regularized Decision Tree Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Regularized Decision Tree Confusion Matrix" + '\033[0m')
#print(confusion_matrix(Regularized y_test,y_pred))
reg_dt_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
reg_dt_df_cm = pd.DataFrame(reg_dt_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
reg_dt_df_cm
DT_roc_auc = roc_auc_score(y_test,reg_dt_model.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, reg_dt_model.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Regularized DTree Classifier (area = %2f)' % DT_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Reg_DT_ROC')
plt.show()
Bagging with the decision tree as base model
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50)
#bgcl = BaggingClassifier(n_estimators=50)
bgcl = bgcl.fit(X_train, y_train)
print(bgcl.score(X_train, y_train))
print(bgcl.score(X_test, y_test))
y_pred = bgcl.predict(X_test)
print('\033[91m' + "Bagging Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Bagging Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
bagg_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
bagg_mcm_df_cm = pd.DataFrame(bagg_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
bagg_mcm_df_cm
bgcl_roc_auc = roc_auc_score(y_test,bgcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, bgcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Bagging Classifier (area = %2f)' % bgcl_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('bgcl_ROC')
plt.show()
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(base_estimator=dTree, n_estimators=10)
#abcl = AdaBoostClassifier( n_estimators=50)
abcl = abcl.fit(X_train, y_train)
print(abcl.score(X_train, y_train))
print(abcl.score(X_test, y_test))
y_pred = abcl.predict(X_test)
print('\033[91m' + "Adaboost Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Adaboost Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
abcl_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
abcl_mcm_df_cm = pd.DataFrame(abcl_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
abcl_mcm_df_cm
abcl_roc_auc = roc_auc_score(y_test,abcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, abcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Adaboost Classifier (area = %2f)' % abcl_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('abcl_ROC')
plt.show()
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50)
gbcl = gbcl.fit(X_train, y_train)
print(gbcl.score(X_train, y_train))
print(gbcl.score(X_test, y_test))
y_pred = gbcl.predict(X_test)
print('\033[91m' + "Gradient Boost Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Gradient Boost Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
gbcl_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
gbcl_mcm_df_cm = pd.DataFrame(gbcl_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
gbcl_mcm_df_cm
GB_roc_auc = roc_auc_score(y_test,gbcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, gbcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Gradient Boost Classifier (area = %2f)' % GB_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('GB_ROC')
plt.show()
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50)
rfcl = rfcl.fit(X_train, y_train)
print(rfcl.score(X_train, y_train))
print(rfcl.score(X_test, y_test))
y_pred = rfcl.predict(X_test)
print('\033[91m' + "Random Forest Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))
print('\033[91m' + "Random Forest Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
rfcl_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])
rfcl_mcm_df_cm = pd.DataFrame(rfcl_mcm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
rfcl_mcm_df_cm
RF_roc_auc = roc_auc_score(y_test,rfcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, rfcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Randomforest Classifier (area = %2f)' % RF_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('RF_ROC')
plt.show()
Dataset is highly unbalanced and have very weak relation between the features. As we are restricting it to ensemble methods, none of the weakly correlated were excluded. This can be further performed in the feature selection in PCA/LDA techniques. Recall values are as well very low however we have obtained better results with ensemble learning.
Based on ROC/AUC curve below is the order of model performance based on False Negatives and Recall values.
Adboosting with decision tree (Recall= .33) > Bagging with decison tree (Recall = .25) > Regularized decision tree with the depth of 7(Recall = 0.18)
However there is still scope in improving this with the help of feature selection.